In this assignment, I'll explore spatial trends evictions in Philadelphia using data from the Eviction Lab and building code violations using data from OpenDataPhilly.
I'll be exploring the idea that evictions can occur as retaliation against renters for reporting code violations. Spatial correlations between evictions and code violations from the City's Licenses and Inspections department can offer some insight into this question.
A couple of interesting background readings:
The Eviction Lab built the first national database for evictions. For more information for the project, you can explore their website: https://evictionlab.org/. Understanding the evisction data will allow us to find solutions to ensure human rights, social equity, and better resource allocations.
geopandas¶The first step is to read the eviction data by census tract using geopandas. The data for all of Pennsylvania by census tract can be downloaded in a GeoJSON format using the following url:
https://eviction-lab-data-downloads.s3.amazonaws.com/PA/tracts.geojson
A browser-friendly version of the data is available here: https://data-downloads.evictionlab.org/
import numpy as np
import pandas as pd
import geopandas as gpd
import hvplot.pandas
import holoviews as hv
import cartopy.crs as ccrs
from matplotlib import pyplot as plt
%matplotlib inline
tracts = gpd.read_file('https://eviction-lab-data-downloads.s3.amazonaws.com/PA/tracts.geojson')
tracts.head()
print(len(tracts))
We will need to trim data to Philadelphia only. Take a look at the data dictionary for the descriptions of the various columns: https://eviction-lab-data-downloads.s3.amazonaws.com/DATA_DICTIONARY.txt
Note: the column names are shortened — see the end of the above file for the abbreviations. The numbers at the end of the columns indicate the years. For example, e-16 is the number of evictions in 2016.
Take a look at the individual columns and trim to census tracts in Philadelphia. (Hint: Philadelphia is both a city and a county).
tracts_trim = tracts.loc[tracts['pl'] == 'Philadelphia County, Pennsylvania']
tracts_trim.head()
print(len(tracts_trim))
tracts_trim.crs #check crs
For this assignment, we are interested in the number of evictions by census tract for various years. Right now, each year has it's own column, so it will be easiest to transform to a tidy format.
Use the pd.melt() function to transform the eviction data into tidy format, using the number of evictions from 2003 to 2016.
The tidy data frame should have four columns: GEOID, geometry, a column holding the number of evictions, and a column telling you what the name of the original column was for that value.
tracts_trim_melt = pd.melt(
tracts_trim,
id_vars=['GEOID','geometry'],
value_vars=['e-{:02d}'.format(x) for x in range(3, 17)],
value_name='number of evictions',
var_name='year'
)
#change the year column to numbers (replace 'e-' to '20' and change to integers)
tracts_trim_melt['year'] = pd.to_numeric(tracts_trim_melt['year'].replace('e-', '20', regex=True))
tracts_trim_melt.head()
#assign crs
tracts_trim_melt.crs = {'init': 'epsg:4326'}
#check data type
type(tracts_trim_melt['year'][0])
Use hvplot to plot the total number of evictions from 2003 to 2016. First, I performed a group by operation and sum up the total number of evictions for all census tracts, and then use hvplot() to make your plot.
groupbyyear = tracts_trim_melt.groupby('year')['number of evictions'].sum()
groupbyyear.head()
trendplot = groupbyyear.hvplot(kind='line')
trendplot
By looking at the number of evictions over time, I found that there are highs and lows for eviction volumn, but overall, the volumn stays at a relatively constant level. There might be cyclicle pattern for total number of evictions. THis might relate to the economic activities or other external factors.
Our tidy data frame is still a GeoDataFrame with a geometry column, so I can visualize the number of evictions for all census tracts.
Use hvplot() to generate a choropleth showing the number of evictions for a specified year, with a widget dropdown to select a given year (or variable name, e.g., e-16, e-15, etc).
by_year = tracts_trim_melt.hvplot(c='number of evictions',
groupby='year',
width=600,
height=550,
geo=True,
dynamic=False).options(cmap='Magma')
by_year
High eviction rates tend to happen in West Philly, North Philly, and Northeastern Phily. Tracts in Southern Philly appears to have decreasing eviction numbers compared to other tracts in the city.
Next, we'll explore data for code violations from the Licenses and Inspections Department of Philadelphia to look for potential correlations with the number of evictions.
L+I violation data for years including 2012 through 2016 (inclusive) is provided in a CSV format in the "data/" folder.
Load the data using pandas and convert to a GeoDataFrame.
violation = pd.read_csv("./data/li_violations.csv")
violation.head()
#drop nan values of the coordinates
violation = violation.dropna(subset=['lat', 'lng'])
#convert to 4326 crs
violation['Coordinates'] = list(zip(violation['lng'], violation['lat']))
from shapely.geometry import Point
violation['Coordinates'] = violation['Coordinates'].apply(Point)
violation = gpd.GeoDataFrame(violation, geometry="Coordinates", crs={"init": "epsg:4326"})
There are many different types of code violations (running the nunique() function on the violationdescription column will extract all of the unique ones). More information on different types of violations can be found on the City's website.
Below, I've selected 15 types of violations that deal with property maintenance and licensing issues. We'll focus on these violations. The goal is to see if these kinds of violations are correlated spatially with the number of evictions in a given area.
violation_types = ['INT-PLMBG MAINT FIXTURES-RES',
'INT S-CEILING REPAIR/MAINT SAN',
'PLUMBING SYSTEMS-GENERAL',
'CO DETECTOR NEEDED',
'INTERIOR SURFACES',
'EXT S-ROOF REPAIR',
'ELEC-RECEPTABLE DEFECTIVE-RES',
'INT S-FLOOR REPAIR',
'DRAINAGE-MAIN DRAIN REPAIR-RES',
'DRAINAGE-DOWNSPOUT REPR/REPLC',
'LIGHT FIXTURE DEFECTIVE-RES',
'LICENSE-RES SFD/2FD',
'ELECTRICAL -HAZARD',
'VACANT PROPERTIES-GENERAL',
'INT-PLMBG FIXTURES-RES']
trim_violation = violation[violation['violationdescription'].isin(violation_types)]
trim_violation
The code violation data is point data. We can get a quick look at the geographic distribution using matplotlib and the hexbin() function. Make a hex bin map of the code violations and overlay the census tract outlines.
#convert to web mercator crs
crs_wm = {'init': 'epsg:3857'}
tracts_trim_melt = gpd.GeoDataFrame(tracts_trim_melt, geometry='geometry', crs=crs_wm)
trim_violation = gpd.GeoDataFrame(trim_violation, geometry="Coordinates", crs=crs_wm)
fig, ax = plt.subplots(figsize=(12.5, 11))
vals = ax.hexbin(trim_violation.geometry.x, trim_violation.geometry.y, gridsize=80, cmap='magma', mincnt=1)
#I have set mincnt =1 so that the non-values are not colored
# add the tract geometry boundaries
tracts_trim_melt.plot(ax=ax, facecolor="none", edgecolor="grey", linewidth=0.2)
# add a colorbar and format
plt.colorbar(vals)
ax.set_axis_off()
ax.set_aspect("auto")
Visually the code violation points do coincide with the clusters for eviction data.
To do a census tract comparison to the eviction data, I need to find which census tract each of the code violations falls into. Use the geopandas.sjoin() function to do just that.
#check if the crs of the two data frame is the same
trim_violation.crs
tracts_trim_melt.crs
#drop the eviction rows, only keep the tracts info
tracts_geo = tracts_trim_melt.drop(['year','number of evictions'],axis = 1)
#drop duplicate rows, keep the unique tracts
tracts_geo_u = tracts_geo.drop_duplicates(subset = 'GEOID', keep = 'first', inplace = False)
tracts_geo_u.head()
#spatial join
joined = gpd.sjoin(trim_violation, tracts_geo_u, op='within', how='left')
joined
Next, we'll want to find the number of violations (for each kind) per census tract. I grouped the data frame by violation type and census tract name.
The result of this step should be a data frame with three columns: violationdescription, GEOID, and N, where N is the number of violations of that kind in the specified census tract.
joined_sum = joined.groupby(['violationdescription','GEOID']).size().unstack(fill_value=0).stack().reset_index(name='N')
joined_sum
We now have the number of violations of different types per census tract specified as a regular DataFrame. I can now merge it with the census tract geometries (from your eviction data GeoDataFrame) to create a GeoDataFrame.
violation_join = tracts_geo_u.merge(joined_sum, on='GEOID')
violation_join
Now, we can use hvplot() to create an interactive choropleth for each violation type and add a widget to specify different violation types.
violation_plot = violation_join.hvplot(c='N',
groupby='violationdescription',
width=600,
height=550,
geo=True,
dynamic=False).options(cmap='Magma')
violation_plot
From the interactive maps of evictions and violations, I found there are a lot of spatial overlap.
As a final step, I'll make a side-by-side comparison to better show the spatial correlations. This will involve a few steps:
hvplot() to make two interactive choropleth maps, one for the data from step 1. and one for the data in step 2.#trim eviction data to include only 2016 data
eviction_2016 = tracts_trim_melt[tracts_trim_melt['year']==2016]
eviction_2016
#choose 'LICENSE-RES SFD/2FD' as the violation type of interest
license_res = violation_join[violation_join['violationdescription']=='LICENSE-RES SFD/2FD']
license_res
eviction_2016_plot = eviction_2016.hvplot(c='number of evictions',
width=490,
height=470,
geo=True,
dynamic=False).options(cmap='Magma').relabel('Evictions in 2016')
license_res_plot = license_res.hvplot(c='N',
width=490,
height=470,
geo=True,
dynamic=False).options(cmap='Magma').relabel('LICENSE-RES SFD/2FD Violations')
combined = eviction_2016_plot+license_res_plot
combined
Significant correlations can be found in Upper Eastern Philly and West Philly.
Identify the 20 most common types of violations within the time period of 2012 to 2016 and create a set of interactive choropleths similar to what was done in section 2.7.
Use this set of maps to identify 3 types of violations that don't seem to have much spatial overlap with the number of evictions in the City.
#convert to web mercator crs
violation = gpd.GeoDataFrame(violation, geometry="Coordinates", crs=crs_wm)
violation.crs
#group and count the number of violations of each type
violation_type = violation.groupby('violationdescription').size()
violation_type.head()
#rank and find the top 20 most frequent violation types
violation_type = violation_type.sort_values(ascending=False)
violation_top20 = violation_type.iloc[:20].reset_index()
violation_top20
violation_top20_f = violation[violation['violationdescription'].isin(violation_top20['violationdescription'])]
#join back with the tracts data
joined2 = gpd.sjoin(violation_top20_f, tracts_geo_u, op='within', how='left')
joined2 = joined2.groupby(['violationdescription','GEOID']).size().unstack(fill_value=0).stack().reset_index(name='N')
violation_join2 = tracts_geo_u.merge(joined2, on='GEOID')
violation_join2.head()
violation_join2_plot = violation_join2.hvplot(c='N',
groupby='violationdescription',
width=600,
height=550,
geo=True,
dynamic=False).options(cmap='Magma')
violation_join2_plot
The three violation types that don't seem to have much spatial overlap with the number of evictions in the city are:
In these kinds of analyses, we get to understand more of the spatial pattern of the eviction data and the code violation data, as well as the correlation between both.